
import Head from 'next/head'

<Head>
  <script>
    {
      `(function() {
         var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "https://hm.baidu.com/hm.js?e60fb290e204e04c5cb6f79b0ac1e697";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
       })();`
    }
  </script>
</Head>

![LangChain](https://pica.zhimg.com/50/v2-56e8bbb52aa271012541c1fe1ceb11a2_r.gif)






问题解答  Question Answering
 [#](#question-answering "Permalink to this headline")
===========================================================================


本文档涵盖了如何评估一般的问题回答问题。在这种情况下，您有一个包含一个问题及其相应的基本事实答案的示例，并且您希望测量语言模型在回答这些问题时的表现如何。



设置 Setup
 [#](#setup "Permalink to this headline")
-------------------------------------------------

出于演示的目的，我们将只评估一个简单的问答系统，该系统只评估模型的内部知识。


请参阅其他笔记本中的示例，其中它评估了模型在回答问题时如何处理模型训练中不存在的数据。







```python
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import OpenAI

```










```python
prompt = PromptTemplate(template="Question: {question}\nAnswer:", input_variables=["question"])

```










```python
llm = OpenAI(model_name="text-davinci-003", temperature=0)
chain = LLMChain(llm=llm, prompt=prompt)

```








示例 Examples
 [#](#examples "Permalink to this headline")
-------------------------------------------------------


为此，我们将只使用两个简单的硬编码示例，但请参阅其他笔记本以了解如何获取和/或生成这些示例的提示。





```python
examples = [
    {
        "question": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?",
        "answer": "11"
    },
    {
        "question": 'Is the following sentence plausible? "Joao Moutinho caught the screen pass in the NFC championship."',
        "answer": "No"
    }
]

```








预测 Predictions
 [#](#predictions "Permalink to this headline")
-------------------------------------------------------------



我们现在可以对这些问题作出预测并加以检验。
 







```python
predictions = chain.apply(examples)

```










```python
predictions

```








```python
[{'text': ' 11 tennis balls'},
 {'text': ' No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not likely that he would be catching a screen pass in the NFC championship.'}]

```








评估 Evaluation
 [#](#evaluation "Permalink to this headline")
-----------------------------------------------------------





我们可以看到，如果我们试图只对答案（  `11`和 `No`）进行精确匹配，它们将不匹配语言模型的答案。然而，在语义上，语言模型在两种情况下都是正确的。

为了解释这一点，我们可以使用语言模型本身来评估答案。





```python
from langchain.evaluation.qa import QAEvalChain

```










```python
llm = OpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions, question_key="question", prediction_key="text")

```










```python
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + eg['question'])
    print("Real Answer: " + eg['answer'])
    print("Predicted Answer: " + predictions[i]['text'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

```








```python
Example 0:
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Real Answer: 11
Predicted Answer:  11 tennis balls
Predicted Grade:  CORRECT

Example 1:
Question: Is the following sentence plausible? "Joao Moutinho caught the screen pass in the NFC championship."
Real Answer: No
Predicted Answer:  No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not likely that he would be catching a screen pass in the NFC championship.
Predicted Grade:  CORRECT

```








自定义提示  Customize Prompt
 [#](#customize-prompt "Permalink to this headline")
-----------------------------------------------------------------------

您还可以自定义使用的提示。下面是一个使用0到10的分数提示它的示例。自定义提示符需要3个输入变量：“查询”、“答案”和“结果”。其中“查询”是问题，“答案”是基本事实答案，并且“结果”是预测的答案。





```python
from langchain.prompts.prompt import PromptTemplate

_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students' answers to questions.
You are grading the following question:
{query}
Here is the real answer:
{answer}
You are grading the following predicted answer:
{result}
What grade do you give from 0 to 10, where 0 is the lowest (very low similarity) and 10 is the highest (very high similarity)?
"""

PROMPT = PromptTemplate(input_variables=["query", "answer", "result"], template=_PROMPT_TEMPLATE)

```










```python
evalchain = QAEvalChain.from_llm(llm=llm,prompt=PROMPT)
evalchain.evaluate(examples, predictions, question_key="question", answer_key="answer", prediction_key="text")

```








无地面实况的评估  Evaluation without Ground Truth
 [#](#evaluation-without-ground-truth "Permalink to this headline")
-----------------------------------------------------------------------------------------------------



在没有地面事实的情况下评估问答系统是可能的。您需要一个 `"context"` 输入，反映LLM用于回答问题的信息。该上下文可以通过任何检索系统获得。

下面是它如何工作的一个例子：





```python
context_examples = [
    {
        "question": "How old am I?",
        "context": "I am 30 years old. I live in New York and take the train to work everyday.",
    },
    {
        "question": 'Who won the NFC championship game in 2023?"',
        "context": "NFC Championship Game 2023: Philadelphia Eagles 31, San Francisco 49ers 7"
    }
]
QA_PROMPT = "Answer the question based on the context\nContext:{context}\nQuestion:{question}\nAnswer:"
template = PromptTemplate(input_variables=["context", "question"], template=QA_PROMPT)
qa_chain = LLMChain(llm=llm, prompt=template)
predictions = qa_chain.apply(context_examples)

```










```python
predictions

```








```python
[{'text': 'You are 30 years old.'},
 {'text': ' The Philadelphia Eagles won the NFC championship game in 2023.'}]

```










```python
from langchain.evaluation.qa import ContextQAEvalChain
eval_chain = ContextQAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(context_examples, predictions, question_key="question", prediction_key="text")

```










```python
graded_outputs

```








```python
[{'text': ' CORRECT'}, {'text': ' CORRECT'}]

```








与其他评估指标比较  Comparing to other evaluation metrics
 [#](#comparing-to-other-evaluation-metrics "Permalink to this headline")
-----------------------------------------------------------------------------------------------------------------


 
我们可以将我们得到的评估结果与其他常见的评估指标进行比较。

为此，让我们从HuggingFace的 `evaluate`包中加载一些评估指标。






```python
# Some data munging to get the examples in the right format
for i, eg in enumerate(examples):
    eg['id'] = str(i)
    eg['answers'] = {"text": [eg['answer']], "answer_start": [0]}
    predictions[i]['id'] = str(i)
    predictions[i]['prediction_text'] = predictions[i]['text']

for p in predictions:
    del p['text']

new_examples = examples.copy()
for eg in new_examples:
    del eg ['question']
    del eg['answer']

```










```python
from evaluate import load
squad_metric = load("squad")
results = squad_metric.compute(
    references=new_examples,
    predictions=predictions,
)

```










```python
results

```








```python
{'exact_match': 0.0, 'f1': 28.125}

```








